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Abstract 

We present a general framework for discriminative estimation based on the maximum en- 
tropy principle and its extensions. All calculations involve distributions over structures and/or 
parameters rather than specific settings and reduce to relative entropy projections. This holds 
even when the data is not separable within the chosen parametric class, in the context of 
anomaly detection rather than classification, or when the labels in the training set are uncer- 
tain or incomplete. Support vector machines are naturally subsumed under this class and we 
provide several extensions. We are also able to estimate exactly and efficiently discriminative 
distributions over tree structures of class-conditional models within this framework. Preliminary 
experimental results are indicative of the potential in these techniques. 

1 Introduction 

Effective discrimination is essential in many application areas including speech recognition, im- 
age classification or identification of molecular binding sites in genomic DNA. Statistical approaches 
used in these contexts for classification generally fall into two major categories - generative or dis- 
criminative - depending on the estimation criterion used for adjusting the model parameters and/or 
structure. Generative approaches rely on a full joint probability distribution over examples and clas- 
sification labels whereas for discriminative methods only the conditional relation of a label given the 
example is relevant. While the full joint distribution in the generative approach carries a number of 
advantages e.g. in handling incomplete examples, the typical estimation criterion (maximum likeli- 
hood or its variatiants) is nevertheless suboptimal from the point of view of classification objective. 
Discriminative methods such as support vector machines [21] or boosting algorithms [8] that focus 
directly on the parametric decision boundary typically yield more robust classification methods, 
whenever they are applicable. 

Full joint distributions and the benefits they convey can be, of course, exploited in discriminative 
approaches as well. We may, for example, interprete the posterior probability of a label given the 
example as a parametric decision boundary (see e.g. [10, 13]). Alternatively, we can induce suitable 



vector space representations for examples from generative models and feed such representations into 
standard discriminative techniques [11]. 

In this paper we provide a more general notion of discrimination, one that applies also in the 
contex of anomaly detection or when the classification labels themselves are uncertain or missing. 
Note that the utility of e.g. unlabeled examples is not obvious [22, 2, 4, 18]. Our approach towards 
general discriminative training relies on the well known maximum entropy principle which embodies 
the Bayesian integration of prior information with observed constraints (see e.g. [15]). The formalism 
that we apply and extend in this paper allows, for example, a feasible discriminative training of both 
the parameters and the structure of a class of joint probability models. The approach is not limited 
to probability models, however, and we extend e.g. support vector machines. 

2 Maximum entropy classification 

Consider first a two-class classification problem where labels y € { — 1,1} are assigned to examples 
X 6 X. Assume we have two class-conditional probability distributions over the examples, i.e., 
P{X\6y) with parameters 6 y , one for each class. The decision rule corresponding to any particular 
parameter setting {#±1} follows the sign of the discriminant function: 

^^=^wB +i (1) 

where 6 = {8\ , 6-\ , b} and b is a bias term, usually expressed as a log-ratio of prior class probabilities 
b = log p/(l — p) . The class-conditional distributions here may come from different families of 
distributions or we might specify the parametric discriminant function directly without any reference 
to probability models. The parameters 8 y may also include the model structure as seen later in the 
paper. 

The parameters 6 = {61, 6-1, b} in the discriminant function should be chosen to maximize 
classification accuracy. Instead of finding a single parameter setting, we consider here a more 
general problem of finding a distribution P(0) over the parameters and using a convex combination 
of discriminant functions, i.e., 



P(@)C(X\@)d@ (2) 

in place of the original discriminant function in the decision rule. The problem is now to find an 
appropriate distribution P(@). Given a set of training examples {Xi, . . . ,Xt} and corresponding 
labels {y\ , . . . , yx} we seek for a distribution P(0) that makes the least assumptions about the choice 
of the parameter values 6 while giving rise to a discriminant function that correctly separates the 
training examples. We can formalize this as a maximum entropy (ME) estimation problem. In other 
words, we maximize the entropy H(P) of P subject to the classification constraints 

J P(Q)[y t L(X t \Q)]dQ> 1 (3) 

for all t = 1, . . . , T. Here 7 specifies a desired classification margin. We note that the solution is 
unique (provided that it exists) since H(P) is concave and the linear constraints specify a convex 
region. Note that the preference towards high entropy distributions (fewer assumptions) applies 
only within the admissible set of distributions V 7 consistent with the classification constraints. 

We can readily extend this formulation to a multi-class setting by introducing additional clas- 
sification constraints. To see this, suppose we have instead m class-conditional probability models 



P(X\9 y ), y = 1,. . . , m, prior class frequencies {p y }, and the associated pairwise discriminant func- 
tions 

^(X t |e)=log^a) + log^ (4) 

where = {9\, . . . , O m ,pi, . . . ,p m }- We may now replace the single constraint per training example 
in eq. (3) with the following m — 1 pairwise constraints 



/ 



P(@)[Cy t jX t \@)]d@> 1 , y^y t , (5) 



to ensure that the training label y t always "wins" the competition against the alternative labels 
V ¥" Vt- For notational simplicity we will consider primarily only binary classification problems in 
the remainder of the paper but emphasize that the analogous extension to a multi-class setting can 
be made. 

The overall ME formulation presented so far has several problems. We have, for example, made 
a tacit assumption that the training examples can be separated with the specified margin. This 
assumption may very well be violated in practice. Moreover, we may have a prior reason to prefer 
some parameter values over others (as well as margin constraints) which requires us to incorporate a 
prior distribution Po(@, 7) into the definition. Other extensions and generalizations will be discussed 
later in the paper. 

A more general formulation that addresses these concerns is given by the following minimum 
relative entropy principle: 

Definition 1 Let {Xt,yt} be the training examples and labels, £(X\@) a parametric discriminant 
function, and 7 = [71, ... ,74] a set of margin variables. Assuming a prior distribution Pq(@,j), we 
find the discriminative minimum relative entropy (MRE) distribution P(@,j) by minimizing 

D(P\\P ) = J P(0) log ^^dQ (6) 

subject to the (soft) classification constraints 

Jp(@,j)[y t £(X t \Q)- lt ]d@d 1 >0 (7) 

for all t. The decision rule for any new example X is given by 

y = S ignfJp(@)£(X\@)d@^ (8) 



Let us make a few remarks about the definition. First, we can recover the previous ME formula- 
tion by appropriately adjusting the prior distribution Po(@, 7) (e.g., if Po(l) peaks around a specific 
setting of the margins) . It is clear that the margin constraints are hidden in the prior distribution 
Poi'j)- Second, if we assume that there is a non-zero prior probability for all 74 taking some negative 
values, we guarantee that the admissible set V composed of all distributions P(6,7) consistent with 
the classification constraints, is never empty. Thus even when the examples cannot be separated 
by any discriminant function in the chosen parametric class (e.g. linear), we get a valid and unique 
solution. Third, the penalty for violating any of the margin constraints also depends on the prior 
distribution Pq ; whenever the mean of 74 deviates from its prior mean under Pq , we incur a penalty 
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D(PIIPq) 



Figure 1: Minimum relative entropy (MRE) projection from the prior distribution to the admissible 
set. 

in the form of relative entropy distance between the corresponding distributions. It is worth noting 
that the penalties are defined in terms of joint specifications of margins but, in certain cases, they 
reduce to the more typical additive penalties of violating the constraints. 

The prior Pq(®, 7) playes an important role in our definition and we must choose it appropriately. 
Let us consider here only the prior over the margin constraints 7. Supposing again that Po(©,7) = 
Po(©)-Po(7), we can, for example, set 

^ , o(7) = II P o(7 i ) (9) 

t 

where Poi'Jt) = ce _c ' 1_7< ', for 74 < 1. A penalty is incurred for margins smaller than 1 — 1/c (the 
prior mean of 74) while margins larger than this are not penalized. In the latter case, the associated 
constraint becomes merely irrelevant. We will see in later sections that this choice of the margin 
prior corresponds closely to the use of slack variables and additive penalties used in support vector 
machines. A number of other choices for Po(7) are possible and we discuss some of them later in 
the paper. 

An important property of the MRE solution is that it can be viewed as a relative entropy 
projection, the e-projection in the terminology of [1], from the prior distribution P (©,7) to the 
admissible set V. Figure 1 illustrates this view. Even in the non-separable case, we can view the 
MRE solution as a projection. This formalism readily extends to the case of uncertain or partially 
labeled examples as we will see later in the paper. 

To solve the MRE problem, we rely on the following theorem. 

Theorem 1 The solution to the MRE problem has the following general form (cf. [7]): 

P(6 ' 7) = -^xj P o( @ ^) eEtMVt£iXtle) ^ t] (10) 

where Z(\) is the normalization constant (partition function) and A = {Ai,...,At} defines a set 
of non-negative Lagrange multipliers, one for each classification constraint. A are set by finding the 
unique maximum of the following jointly concave objective function: 

J(A) = -logZ(A) (11) 



Whether the MRE solution can be found in a feasible way depends entirely on whether we can 
evaluate the partition function Z(X), 

Z(\)= J P (@, 1 )e^ Xt[vtC(Xtl@) -^ ] d@dj (12) 



in closed form. Given a closed form expression for Z(X), the maximum of the jointly concave ob- 
jective function J(A) can be subsequently found through any standard convex optimization method 
such as Newton-Raphson. The resulting set of Lagrange multipliers {X t } then define the MRE 
solution as indicated in the theorem. Finally, predicting a label for any new example X involves av- 
eraging the discriminant function £(6) with respect to the marginal P(6) of the MRE distribution 
(see Definition 1). Finding this marginal as well as performing the required averaging are no more 
costly than computing Z(\). We will elaborate these calculations further in the context of specific 
realizations. 

The MRE solution is sparse in the sense that only a few Lagrange multipliers will be non-zero. 
This arises because many of the classification constraints become irrelevant once the constraints are 
enforced for a small subset of examples. For support vector machines that are subsumed under the 
above general definition, this notion translates into a sparse representation of the separating hyper- 
plane. Sparsity leads to immediate generalization guarantees (independent of the dimensionality of 
the parameter or example space): 

Lemma 1 The generalization error e g of the MRE classifier satisfies 

e g < E{ fraction of non-zero Lagrange multipliers } (13) 

where the expectation is over the choice of the training set. 

Practical leave-one-out cross-validation estimates of the generalization error can be derived on 
the basis of this result (cf. [21, 12]). We may also make use of generalization error results derived 
for convex combination of classifiers [20] to obtain more informative generalization bounds for MRE 
classifiers. The details are left for another paper. 

3 Practical realization of the MRE solution 

We now turn to the question of actually finding the MRE solution. Consider first the following 
elementary but helpful lemma 

Lemma 2 Any factorization of the prior Pq(@,j) across any disjoint sets of variables {6,7} leads 
to a disjoint factorization of the MRE solution P(@,j) across the same sets of variables provided 
that these variables appear in distinct additive components in yt£(Xt,@) — 74- 

If we assume that the labels {y t } are fixed and that the prior distribution P (©,7) factorizes 
across the components {G \ b, b, 7}, then according to the lemma, the MRE solution factorizes in the 
same way. This factorization property allows us to eliminate e.g. the bias term from the remaining 
solution by means of imposing additional constraints on the Lagrange multipliers. This is analogous 
to the handling of the bias term in support vector machines [21]: 

Lemma 3 Assuming Po(©>7) = -Po(© \ bj'y)Po(b) and Po(b) approaches a non-informative prior, 
then P(0, 7) = P(@ \ b, j)P(b) and P(@ \ b, 7) can be found independently from P(b) provided that 
we require J2t ^tVt = 0. 

With the help of these results, we will consider now a few specific realizations such as support 
vector machines and a class of graphical models. 
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Figure 2: Three margin prior distributions (top row) and the corresponding potential terms (bottom 
row) from Eq. (15). 



3.1 Support vector machines 

It is well known that the log-likelihood ratio of two Gaussian distributions with equal covariance 
matrices yields a linear decision rule. With a few additional assumptions, the MRE formulation 
gives support vector machines: 

Theorem 2 Assuming C(X, 6) = 6 T X-b and Po(@,-y) = Po(8)Po(b)P (7) where P Q {6) is N(0, 1), 
Po(b) approaches a non-informative prior, and Poij) is given by eq. (9) then the Lagrange multipliers 
X are obtained by maximizing J(X) subject to < Xt < c and J^ Xtyt = 0, where 



J(A) = £[A t + log(l - X t /c) } - \ J2 hAt>ytyt>(XjX t ,) 



(14) 



The only difference between our J(A) and the (dual) optimization problem for SVMs is the 
additional potential term log(l — Xt/c). This highlights the effect of the different miss-classification 
penalties, which in our case come from the MRE projection. Figures 2a) and c) show, however, 
that the additional potential term does not always carry a huge effect (for c = 5). Moreover, in the 
separable case, letting c — > oo, the two methods coincide. The decision rules are formally identical. 

The choice of the prior distribution Po(7) leads to different potential terms. Figure 2 gives the 
following priors and their corresponding potential terms 



Margin prior Dual potential term 

a) Po(7)«e- c ( 1 -^), 7 <l, A t + log(l - A t /c) 

b) P (7)«e- c l 1 -^l, A t + 21og(l-A t /c) 

c) P (7)oce- c2 ( 1 ^) 2 /2, A t -(A t /c) 2 



(15) 



where a) is the case discussed in the theorem. Note that the resulting potential terms may or may 
not set an upper bound on the value of X t . In a) and b) X t is bounded by the constant c whereas 
in c) no such bound exists. 



3.1.1 Extension 

We now consider the case where the discriminant function C(X, 6) corresponds to the log- 
likelihood ratio of two Gaussians with different (and adjustable) covariance matrices. The parame- 
ters 6 in this case are both the means and the covariances. The prior Po(O) must be the conjugate 
Normal- Wishart to obtain closed form integrals 1 for the partition function, Z. Here, P(©i,0_i) 
is P(mi, Vi)P(m_i, V_i), a density over means and covariances (and the factorization follows from 
our assumptions below). 

The prior distribution has the form P (Oi) = M{m\\ mo, V\/k) IW{V\] kVo,k) with parameters 
(k, mo, Vo) that can be specified manually or one may let k — > to get a non-informative prior. We 
used the MAP values for k, mo and Vo from the class-specific data 2 . Integrating over the parameters 
and the margin, we get a partition function which factorizes Z = Z 1 x Z\ x Z_i. For Z\ we obtain 
the following: 

z x oc N- d/2 IttSiI"^/ 2 rrj =1 r ( Nl + *~ j ) ( 16 ) 

Ni=Z t w t X^Zt^Xt S 1 =Y. t w t X t Xj-N 1 X 1 Xj (17) 

Here, w t is a scalar weight given by w t = u(y t ) + yt^t for Z\. To solve for Z-\ we proceed in a 
similar manner with the exception that the weights are set to w t = u(—y t ) — yt^t- u(-) here is the 
step function. Given Z, updating A is done by maximizing the corresponding negative log-partition 
function J(A) subject to < X t < c and J2 t X t yt = where: 

J(A) = ^[Z Q A 4 + log(l - X t /c)] - log Zi (A t ) - log Z_i (A 4 ) (18) 

t 

The potential term above corresponds to integrating over the margin with a margin prior P (7) oc 
e -c{i a -i) w j t jj ^ <l a . We pick l a to be some a-percentile of the margins obtained under the standard 
MAP solution. Optimal lambda values are found via constrained gradient descent. The resulting 
marginal MRE distribution over the parameters (normalized by the partition function Z\ x Z_i) is 
a Normal- Wishart distribution itself, P(©i) = N{mi\ X u Vi/iVi) lW{Vi;Si,Ni) with the final A 
values. Predicting the labels for a data point X under the final P(0) involves taking expectations 
of the discriminant function under a Normal- Wishart. This is simply: 

EpfeoPogPpTiei)] = constant - ^-(X - X^S^^X - XJ (19) 

We thus obtain discriminative quadratic decision boundaries. These extend the linear boundaries 
without (explicitly) resorting to kernels. Of course, kernels may still be used in this formalism, 
effectively mapping the feature space into a higher dimensional representation. However, unlike 
linear discrimination, the covariance estimation in this framework allows the model to adaptively 
modify the kernel. 

3.1.2 Experiments 

In the following, we show results using the minimum relative entropy approach where the dis- 
criminant function (jC(X, ©)) is the log-ratio of Gaussians with variable covariance matrices on 
standard 2-class classification problems (Leptograpsus Crabs and Breast Cancer Wisconsin). In 

1 This can be done more generally for conjugate priors in the exponential family. 

2 The prior here is the posterior distribution over the parameters given the data, i.e. an empirical Bayes procedure. 
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Neural Network (1) 
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Neural Network (2) 
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Linear Discriminant 




8 


Logistic Regression 
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MARS (degree = 1) 




4 


PP (4 ridge functions) 
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Gaussian Process (HMC) 
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Gaussian Process (MAP) 
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SVM - Linear 


5 


3 


SVM - RBF a = 0.3 


1 


18 


SVM - 3rd Order Polynomial 


3 


6 


Maximum Likelihood Gaussians 


4 


7 


MaxEnt Discrimination Gaussians 


2 


3 



Table 1: Leptograpsus Crabs 



(a) Training ROC 



(b) Testing ROC 



Figure 3: ROC curves on Leptograpsus Crabs for discriminative (solid line), Bayes / ML models 
(dashed line) and SVM linear models (dotted line). 



addition we display a two-dimensional visualization example of the classification. Performance is 
compared to regular support vector machines, maximum likelihood estimation and other methods. 

The Leptograpsus crabs data set was originally provided by Ripley [19] and further tested by 
Barber and Williams [3]. The objective is to classify the sex of the crabs from 5 scalar anatomical 
observations. The training set contains 80 examples (40 of each sex) and the test set includes 120 
examples. 

The Gaussian based decision boundaries are compared in Table 1 against other models from [3]. 
The table shows that the maximum entropy (or minimum relative entropy) criterion improves the 
Gaussian discrimination performance to levels similar to the best alternative models. The bias was 
estimated separately from training data for both the maximum likelihood Gaussian models and the 
maximum entropy discrimination case. In addition, we show the performance of a support vector 
machine (SVM) with linear, radial basis and polynomial decision boundaries (using the Matlab 
SVM Toolbox provided by Steve Gunn). In this case, the linear SVM is limited in flexibility while 
kernels exhibit some over-fitting. 

In Figure 3 we plot the ROC curves on training and testing data. The ROC curve shows improved 
classification for maximum entropy (minimum relative entropy) case. 
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Nearest Neighbour 




11 


SVM - Linear 
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10 


SVM - RBF <r = 0.3 





11 


SVM - 3rd Order Polynomial 
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13 


Maximum Likelihood Gaussians 


10 


16 


MaxEnt Discrimination Gaussians 


3 


8 



Table 2: Breast Cancer Classification 



(a) Training ROC 



(b) Testing ROC 



Figure 4: ROC curves on Breast Cancer for discriminative (solid line), Bayes / ML models (dashed 
line) and SVM linear models (dotted line). 



Another data set which was tested was the Breast Cancer Wisconsin data where the two classes 
(malignant or benign) have to be computed from 9 numerical attributes from the patients (200 
training cases and 169 test cases). The data was first presented by Wolberg [24]. We compare our 
results to those produced by Zhang [25] who used a nearest neighbour algorithm to achieve 93.7% 
accuracy. As can be seen from Table 2, over-fitting seems to prevent good performance for kernel 
based SVMs. The maximum entropy discriminator achieves 95.3% accuracy. 

In Figure 4 we plot the ROC curves on training and testing data. The training ROC curves 
show improved discrimination for the maximum entropy method. ROC curves for all three methods 
are equivalent on testing however since we typically assume that bias is estimated exclusively from 
training data, the results in Table 2 are more significant. 

Finally, for visualization, we present the technique on a 2D set of training data in Figure 5 and 
Figure 6. The SVM in Figure 5(a) attempts to achieve maximum descrimination but is limited to a 
linear decision boundary. It only succeeds after the application of a kernel as in Figure 5(b), where 
a 3rd order polynomial kernel is used. In Figure 6(a), the maximum likelihood technique is used 
to estimate a 2 Gaussian discrimination boundary (bias is estimated separately) which has more 
flexibility than the linear SVM yet fails to achieve the desired optimal classification. Meanwhile, 
the maximum entropy discrimination technique places the Gaussians in the most discriminative 
configuration as shown in Figure 6(b). 




(a) Linear SVM (b) Polynomial Kernel SVM 

Figure 5: Classification visualization SVMs. 




(a) Max Likelihood (b) Max Ent Discrimination 

Figure 6: Classification visualization for Gaussian discrimination. 



3.2 The Fisher kernel classifier 

Here we demonstrate that the MRE formulation proposed in this paper contains the Fisher kernel 
method of [11]. The Fisher kernel method provides a combination of a generative model P(X\8) 
with a discriminative method such as support vector machines through defining an appropriate 
kernel function. The kernel function, called the Fisher kernel, can be computed from any generative 
model in the neighborhood of some desired e.g. maximum likelihood parameter setting 6*. The 
Fisher kernel function is given by 



K fk (X,X') = U x (6*f F{6*y l U x ,(6*) 
where Ux(9) is the Fisher score 

Ux(0) = VelogP(X\e)\ e=g ., 



(20) 



(21) 



10 



F(9) = E{ Ux(d)Ux(9) } is the Fisher information matrix 3 and the expectation is with respect to 
P(X\8). Replacing the inner product Xj ' X v between the examples in Theorem 2 with the kernel 
function in Eq. (20) amounts to the "simple" Fisher kernel method as explained in [11]. 

Our goal in this section is to show that we can recover the Fisher kernel method in the MRE 
framework so long as the prior distribution Po(0, 7) is chosen in an appropriate way. We start with a 
few necessary regularity assumptions about the family of distributions P(X\9) in some small (open) 
neighborhood 0(6*) of 0*: 

1. for any X £ X, Ux(9) = V# logP(A|#) is a continuously differentiable vector valued function 
of<9 

2. F(6) = E{ U X (O)U X '(0) } exists and is positive definite 

Let us define, in addition, the differential (symmetric) relative entropy distance between the 
distributions P(X\9) and P(X\9*) 

d{e, e* ) 2 = he - e*) T Fie*)- 1 (e - e*) (22) 

valid whenever e « e* . We assign a prior distribution Po(9) in terms of this distance 4 

«<•>=*<£« «-**•■•''* (23) 

where /3 serves as a scaling parameter. This prior assigns a low probability to all 9 for which the 
corresponding probability distribution P(X\9) deviates significantly from P(X\9*). Another way to 
view this prior is as a local isotropic Gaussian prior distribution in the probability manifold induced 
by the family of distributions P{X\9), 9 G 0(61*). 

In the MRE formalism the objective is to minimize the relative entropy distance between the 
MRE distribution P and the prior P subject to the classification constraints 

J P(@, 1 )[y t £(X t \@,P) - lt }d@dj >0 (24) 

where the discriminant function C(Xt\®,/3) is the scaled log-likelihood ratio: 

£(X t \QJ) = [P^\o g ^^-b] (25) 

and 6 = {9, b}. This discriminant function encourages parameter values 9 that are indicative of the 
+ 1 class relative to the "null model" P{X t \9*). 

The following Theorem now establishes the desired connection to the Fisher kernel method. 

Theorem 3 If we replace Pq(9) with Eq. (23) in Theorem 2 and the discriminant function with 
£(X t \@,/3) defined above as well as let f3 — > oo, then the objective function J(X) reduces to 

J(X) = £)[ A t +log(l - X t /c)} - ^AtAt^^K^A^AV) (26) 

t t,v 

where Kfk(Xt,Xf) is the Fisher kernel of Eq. (20). 

We note that this result is merely a formal relation between the MRE principle and the Fisher 
kernel and does not necessarily provide any additional motivation. 

3 For many probability distributions the Fisher information matrix may not be possible to compute in closed form. 
However, it is the covariance matrix of the Fisher scores and thus can be easily approximated by sampling. 

4 A more precise definition of this prior would involve setting it to zero outside the open neighborhood where the 
regularity conditions may no longer hold. For large /3, the effect of this condition vanishes and we omit it here for 
simplicity. 

11 



3.3 Graphical models 

The MRE formulation can accomodate discriminant functions resulting from log-ratios of general 
graphical models. The MRE distribution, i.e. -P(6), in this setting is over both the parameters 
and the structure of the model. Since the estimation is carried out in the space of distributions 
the distinction between discrete or continuous variables is immaterial. The framework does not, 
however, admit efficient solutions without restrictions on the class of graphical models. For example, 
assuming the structure remains fixed and that the class-conditional models have no latent variables, 
then the MRE distribution P(0) over the parameters can be obtained efficiently. This requires 
additional technical assumptions such as the use of conjugate priors, the parameter independence 
assumption of [6] and the fact that the probability model must be tractable for any fixed setting 
of the parameters. Although restricted, this class does include e.g. naive Bayes models, mixture of 
tree models and so on. 

For a special class of graphical models whose structure is a tree, both the parameters and the 
structure can be estimated efficiently within our discriminative framework. In the remainder, we 
will consider such tree structured models. 

First, we define a tree distribution. Let V denote the set of variables of interest, \V\ = n, x v € X v 
a particular value of«eV and X € X an assignment to all the variables in V . Like any graphical 
model, a tree distribution is defined in two stages. First, one defines a graph (V, E), called structure, 
whose vertices are the variables in V and whose edges encode dependencies between these variables. 
A tree is an undirected graph over V that is connected and has no cycles. For any tree over n 
vertices \E\ = n — 1. Because such a tree spans all the nodes in V, it is often called a spanning 
tree. Then, the tree distribution is defined as a product of factors corresponding to the edges and 
vertices. 

rp, x _ Il(u,v)eE T uv( x u,X v ) ,„_,. 

{X) " n„ e yT.(*.)« ta "- 1 ( j 

where degv is the degree of vertex v, i.e. the number of edges incident to v £ V and T uv and T v 
denote the marginals of T: 



J-UV \&Ul %V ) — / -*- \^- ) 

V — X V ,U—X U 

T v (x v ) = ^T(X). 



When the variable x is discrete, the marginals T uv and T v can be represented as probability tables 
denoted respectively 8 uv (x u ,x v ) and 8 v (x v ). The values 8 are the parameters of the distribution. 
When it will be necessary to emphasize the dependence of the tree distribution on its structure and 
parameters we will use the notation T(x\E,9). 

By taking the logarithm of T(X) and conveniently grouping the factors one obtains 

logT(X) = 5>gT„( a;t ,)+ £ log T T ;: ( ^'^\ = w (X)+ £ w uv (X). (28) 

^, r r- rn J-V\Xv )l-U\Xu) ^ ^ 

* J v 

w a (X) 

In words, the log-likelihood is a sum of terms w uv (X) each corresponding to an edge (and depending 
only on the values of the variables u,v associated with that edge) plus a structure independent term 
wo(X) that depends on all the variables. All the terms are functions of the tree parameters 8. 
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3.3.1 Discriminative learning of tree structures 

A tree model is defined by a set of discrete variables encoding its structure and a set of continuous 
variables representing its parameters. To use the MRE framework we must define a prior joint distri- 
bution over the structures and their associated parameters. We will assume that the structure and 
the parameters are independent a priori; moreover, we shall assume that except for the functional 
dependencies among the parameters that are imposed by the fact that they have to represent a valid 
joint distribution over X there are no other statistical or functional dependencies. These assumptions 
correspond to the parameter independence and parameter modularity assumptions of [9] (see also [6]). 
In our case, this means that there is a set of parameters 8 = {9 uv (i,j),u,v € V, i € X u , j € X v } 
associated with the edges such that in any tree model containing an edge uv € E, the pairwise 
marginals T uv (x u ,x v ) are given by uv (x u ,x v ) regardless of the presence of other edges in E and 
their parameter values. This simplification, in turn, allows the MRE formulation for only structures 
(with a fixed set of parameters or a fixed distribution over their values) , for parameters only, or for 
both. 

We start with a MRE estimation of structures only when the pairwise marginals uv (x u ,x v ) are 
assumed fixed. Note that each tree nevertheless makes use of a different set of n — 1 edges and 
thereby a different set of parameters. For each class or label s € {1,-1}, we have a separate set of 
fixed parameters 8 s . In the experiments below, the values of these parameters were obtained from 
empirical (class-conditional) marginals. We assume a uniform prior over the class-conditional tree 
structures E s . 

Definition 2 Given a set (X',y'),i = 1,...T of labeled examples, a set of margin variables 7 = 
[71, . . . ,7t] and a prior distribution Pq(Ei,E-i,j) the MRE distribution P(Ei,E-i,j) is the one 
minimizing D(P\\Pq) subject to 



J2 /p(£i,£-i,7) 



E U E-1 



ytlOg T(X t \E^0^)-^ 



dj >0 fort = 1....T (29) 



Assuming Pq(Ei, £-1,7) = P (Ei )P (E_ i)Po(7), Lemma 2 implies that the solution is factored as 
P(£i)P(£_i)P(7) with 



p {Es ) = i_ e Ef =1 ^»'K(*)+E„. eB . <•(*)] = ^ y[ w: v (30) 

uvEbs 

for s = 1,-1 and 

T 

Wi = eE^^-oW w£ v = Y[(w' uv (X t ))' x «", 8 = 1,-1. (31) 

In the above the normalization constants Z s and the factors W s are functions of the Lagrange 
multipliers A which need to be set. Provided that we can obtain the normalization constants 
(functions) Z s in closed form, A are set to maximize the dual objective 

J(A) = 7 -A-logZi-logZ_i. (32) 

where, for simplicity, we have assumed a fixed setting of the margin variables {"ft}- 

3.4 Computing the normalization constant and its derivatives 

The number of all possible tree structures over n vertices is n™~ 2 [23] and thus computing the 
normalization constants by enumerating all the tree structures is clearly not possible for reasonable 
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n. However, a remarkable graph theory result enables us to perform all the necessary summations 
in closed form in polynomial time. This is the Matrix Tree Theorem quoted below. 

Theorem 4 (Matrix Tree Theorem) [23] Let G = (V, E) be a multigraph and denote by a uv = 
a vu > the number of undirected edges between vertices u and v. Then the number of all spanning 
trees of G is given by \A\ UV (— 1)("+") the value of the determinant obtained from the following matrix 
by removing row u and column v 5 . 



deg(wl) -a 12 -ai 3 
-a,2i deg(w 2 ) -a23 



-02, 



-a n ,i 



-a n ,2 



deg(w„) 



(33) 



By extending the Matrix Tree theorem to continuous-valued A and letting the weights W uv play 
the role of a uv , one can prove 



Theorem 5 Let P(E) be a distribution over tree structures defined by 

p(e) * Wo n w «v 



uv£E 



Then its normalization constant Z is equal to 

z = w ° y, n w - 

E uvEE 

with Q(W) being the (n — 1) x (n — 1) matrix 



W \Q(W)\ 



Quv(W) = Q VU (W) 



—W uv 1 <u <v <n — 1 

T,v-=iW« v l<u = v<n-l 



(34) 



(35) 



(36) 



This shows that summing over the distribution of all trees, when this distribution factors according 
to the trees' edges, can be done in closed form by computing the value of a order n — 1 determinant, 
operation that involves C(n 3 ) operations. 

To optimize the Lagrange multipliers, we must compute derivatives of J(A) or, equivalently, 
derivates of the log-partition functions with respect to A. It is well known that such derivatives lead 
to averages with respect to the distribution in question (for details see Appendix A). In our case, 
for example, 



dlogZ. 



sy t <\ogT(X t \E s ,6 s )> PiEs) = sy 



w, 



•'o(X t ) + Y t <v(X t )WZ v M> u 



(37) 



where M s is a linear function of Q^ 1 (W s ) given in Appendix A. Inverting the matrix Q{W) is 0(n 3 ) 
and this operation can be done once before the summations in equations (37). Thus, computing 
the derivatives of the normalization constant w.r.t all \ t takes 0(n 3 + n 2 T) operations and 0(n 2 ) 
extra space. 

6 Note that Aasa whole is a singular matrix. 
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Finally, to obtain the decision rule for any new example X we must compute averages of the 
log-likelihood ratio with respect to the (marginal) MRE distribution P{E\)P{E-\): 

V = sgn{ Y. Eu E^nEi)P{E- l )\og T f x X ^p 1) } (38) 

= sgn{ wl(X) - w^(X) + <Y j w 1 uv (X)>p (Ei ) ~ <J2 w ul (X)>p(e. 1 ) } 

uvEEi uv£E-i 

(39) 

where we have omitted a possible bias term b. The required averages can be computed analogously 
to Eq. (37) yielding e.g. 

<£™UX)>P( El )= J2 w l(X)W uv Ml (40) 

where M\ v is the same matrix as in Eq. (37) and has been already computed in the last step of the 
training algorithm. Classifying a new data point therefore requires only roughly 0(n 2 ) operations. 

3.5 MRE distributions over tree structures and parameters 

Here we describe briefly how to find the MRE distribution over both structures and parameters, 
i.e., P(Ei,6 1 ,E-i,6~ 1 ). We assume a factored prior Po(6 1 )Po(0~ 1 ) over the parameters and as be- 
fore a uniform prior over the structures. In addition to the parameter independence and modularity 
assumptions used earlier, we must assume that the priors Pq(8 3 ), s = 1,-1 are likelihood equivalent 
(i.e. they assign the same value to models having the same likelihood for all data sets). In this case, 
the priors over parameters are forced to be Dirichlet [9] and defined in terms of a set of equivalent 
marginal counts N^ v (x u ,x v ) satisfying 

J2K v (x u ,X v )=N:( Xv ) J2Nuv(Xu,X v ) = K(x u ) J2^v(^n,X v )=N S (41) 

x u x v x u x v 

Because the prior over parameters is independent of the structure, the MRE distribution factor- 
izes as 

P{E S ,8 S ) = ^-P (8 s )e^ sXtVtlogT(XtlE " n (42) 

To evaluate the partition function Z s , the parameters 8 s can be analytically integrated out before 
the summation over structures. The resulting marginal distribution over tree structures is similar 
to equation (35) 

s E uveE 

with the factors W s are now functions of both A and Dirichlet distribution parameters N s (see 
appendix B for exact expression). 

The classification rule is also similar in form to equation (39) with the terms w s depending on 
A, the data, and the equivalent counts as described in Appendix B. 

3.6 General Bayes nets 

A Bayes net with given structure can be parametrized by the set of conditional distributions 
P(v |pa(u) = a; pa (^) ) of a variable given a configuration of its parents. A discriminative MRE solution 
can be found for the parameter distribution P(8 1 ,8~ 1 ) assuming complete observations. Finding 
the MRE distribution over structures is, however, unlikely to be feasible for other than trees (c.f. 
[5])- 
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Figure 7: ROC curves for the ME discriminative classifier (full line) and the ML classifier (dashed 
line) for the splice junction classification problem. The minimum test errors are 12.4% and 14% 
respectively. 
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ML Discriminant 



ML Discriminanl 



(a) 



(b) 



Figure 8: Logarithmic weights w uv versus mutual informations I uv for class 1 (a) respective —1 (b). 
The square in position uv, u < v represents w uv while its symmetric, vu represents I uv . Larger 
values appear more back in the figures. 

3.7 Experiments 

We tested our model in the fixed parameter version on the detection of DNA splice sites and 
compared its performance to the performance of a classifier using a Maximum Likelihood (ML) tree 
for each class. In both cases, the tree parameters 8 were the ML parameters for the corresponding 
class (empirical class-conditional marginals). 

The domain consists of 25 variables representing sites around a (hypothetic) splice junction. 
The test set had 400 examples split equally between the two classes; the training set consisted of 
4724 examples, about a fourth being positives ones. For simplicity, we used a fixed margin 7 = 4, 
the largest value that allowed perfect class separation. The number of A's that are nonzero in this 
example is 61 (out of 400) suggesting a performance level of about %15 according to Lemma 1. The 
ROC curves for the two classifiers are compared in figure 7. MRE distribution over tree structures 
is superior to a pair maximum Likelihood trees, although the parameter values are identical. The 
test set error is 14.0% for the ML classifier and 12.3% for the MRE method. The training error is 
0.5% for the ML classifier and zero for the discriminative one indicating that the MRE method is 
resistant to overfitting. 

Figure 8 compares the "edge weights" for the two classifiers. These edge weights reflect the 
preferences assigned to tree structures in the MRE distribution or in the (single) class-conditional 
maximum likelihood (ML) tree. Since the estimation criterion differs in the two cases, the most 
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likely tree in the MRE solution does not in general equal the ML tree structure. Figure 8a) displays 
w^ v = log(W* v ) factors corresponding to each edge uv in the MRE distribution for class 1 as well as 
the respective mutual information values I* v . Since both matrices are symmetric, one can display 
both sets of values in a 25 by 25 square: the upper left half represents the ME weights whereas the 
lower right half of the square shows the mutual information. Figure 8,b shows the same results for 
class -1. Note that summing w^ v or l\ v across the edges of a particular tree pertains directly to the 
log-probability of the tree and thus the comparison is meaningful 6 . 

The figure shows that there are relatively few edges with large weights on both sides of the 
diagonal. This is particularly relevant for the discriminative model of the positive examples, since 
it shows that the MRE distribution decays rapidly around its peak. The maximum W^ v is more 
than 10 3 times the next largest value, clearly separating edges that are discriminative and those 
whose inclusion or exclusion has little effect on discrimination. This contrast is understandably less 
pronounced for the negative examples that represent a diverse collection of spurious splice sites. 

A second important remark is that neither figure 8, a nor 8,b are symmetric w.r.t the diagonal. In 
other words, not all pairs of variables that exhibit high mutual information are also discriminative. 
Note for example that the subdiagonal band showing that adjacent variables are informative of 
each other is almost completely effaced under discriminative training. Our method brings out the 
discriminative structure of the data, which is different from its structure as a density estimator. 

4 Anomaly detection 

In anomaly detection we are given a set of training examples representing only one class, the 
"typical" examples. We attempt to capture regularities among the examples to be able to recognize 
unlikely members of this class. Estimating a probability distribution P(X\6) on the basis of the 
training set {X\, . . . , Xt} via the standard maximum likelihood (or analogous) criterion is not 
appropriate since there is no reason to further increase the probability of those examples that are 
already well captured by the model. A more relevant measure involves the level sets 

Xy = {X€ X: logP(X\6) > 7 } (44) 

These level sets are used in deciding the class membership, even in the context of ML parameter 
estimation. We therefore estimate the parameters 8 to optimize an appropriate level set. As before, 
we cast this problem as MRE: 

Definition 3 Given a probability model P(X\8), 9 € 0, a set of training examples {X\, . . . , Xt}, 
a set of margin variables 7 = [71,..., jt], and a prior distribution Pq(0,j) we find the MRE 
distribution P(6,j) such that minimizes D(P\\Po) subject to the constraints 

j P(0,j) [log P(X t \6) - j t ]d6dj > (45) 

for allt = l,...,T. 

Note that this is again a MRE projection problem whose solution can be obtained as before. 
The choice of Pn(7) m -Po(#,7) = Po(S)Po(l) is not as straightforward as before since each margin 
74 needs to be close to achievable log-probabilities. We can nevertheless easily find a reasonable 
choice e.g. by relating the prior mean of 74 to some a— percentile of the training set log-probabilities 
generated through ML or other standard parameter estimation criterion. Denote the resulting value 
by l a and define the prior Poi'Jt) as Poi'Jt) = ce" c ''"" 7< ' for 74 < l a . In this case the prior mean 
of 74 is l a - 1/c. 

6 The comparison is done upto a scaling factor and an additive constant. 
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Figure 9: a) Distribution of training set log-likelihoods for the MRE model (solid line) or the Bayes 
model (dashed-line) . b) ROC curve for the two models on an independent test set. 



We have verified experimentally for a simple product distribution that this choice of prior to- 
gether with the MRE framework leads to a real improvement over standard (Bayesian) approach. 
Figure 9 illustrates the benefit of the MRE approach for discriminating between true and spurious 
splice sites. The examples were fixed length DNA sequences (length 25) and we used the following 
product distribution of simple multinomials: 



25 



25 



p(x\e) = Y[p i (x i \6 i ) = l[e. 



x t \i 



(46) 



where X = {x\, . . . ,£25}, Xi £ {A,C,T,G}, and ^2 X . 6 Xi u = 1. The model parameters {Q Xi \i\ were 
estimated on the basis of only true examples (7000). The estimation criterion was either Bayesian 
with an independent Dirichlet prior over each component distribution {#.|,} or through the relative 
entropy projection method with the same prior. Figure 9a) indicates, as expected, that the training 
set log-likelihoods from the MRE method are more uniform and without the long tails 7 . This 
difference leads to improved anomaly detection as shown by the ROC curve in Figure 9b). The test 
set consisted of 1192 true splice sites and 3532 spurious ones. 

We expect the effect to be more striking in the context of more sophisticated models such as 
HMMs that may otherwise easily capture spurious regularities in the data. In the next section we 
describe how such models can be used efficiently within the MRE framework. 



4.1 Extension to latent variable models 

In the presence of latent variables (missing information) we can no longer use the above formula- 
tion directly. This arises because log P(X t \0) does not decompose into a sum of simple components. 
We can, however, achieve an efficient lower bound solution. If we let X^ be the set of latent variables, 
we can resort to the following variational lower bound: 



log P(X t \6) >Y J Qt(X h )logP(X u X h \6)+H{Q t ) 



(47) 



where H(Q t ) is the entropy of the Q t distribution. A separate transformation has to be introduced 
for each training example. Note that the lower bound is reasonable in this context since the objective 

7 To compute these log-likelihoods from the MRE method, we used the MRE solution as the posterior distribution 
over the parameters. This is suboptimal for the MRE method given that the criterion is slightly different but suffices 
here for the purposes of illustration. An analogous figure with minor differences could be computed on the basis of 
J P(9) log P(X\9)d9 for the two methods. In this case, the figure would be suboptimal for the Bayesian approach. 
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is to guarantee that all (or most) training examples have likelihoods above some margin threshold. 
Whenever the lower bound exceeds the threshold, so does the original likelihood. 
The MRE distribution P(0, 7) is obtained under the following constraints: 



/ 



P(o,-y) 



J2Qt(x h )\ogP(x t ,x h \0) - lt 

L X h 



d0 + H(Q t ) > (48) 



which are of the same form (linear) as before. Note that we have made an additional assumption 
that Qt{Xh) is functionally independent of the parameters 6. This assumption guarantees that the 
MRE distribution P(9, 7) can be computed efficiently for a large class of probability models such 
as mixture models and HMMs. The loss in accuracy due to this simplifying assumption vanishes 
whenever the (marginal) MRE distribution P{8) becomes peaked. In principle, this means that we 
can always find the single most discriminative setting of the parameters even with the variational 
bound. Roughly speaking, we incur a loss only relative to the exact MRE approach. 

The overall solution to the MRE problem is no longer unique, however, but we can find a locally 
optimal solution iteratively as follows: 

Step 1. Fix {Qt{X h )} and find the MRE distribution P{6,^) as before 

Step 2. Fix P(6,j) and let 

Q t (X h ) cxexpl.Jp(0)logP(X t ,X h \e)de\ (49) 

Both steps can be computed efficiently for a large class of models such as HMMs assuming the prior 
Po(9) is Dirichlet and factorizes across the parameters. More generally, the prior should be the 
conjugate prior satisfying the parameter independence assumption of [6] (see also [9]). 

The iterative algorithm actually converges in the sense defined by the following theorem: 

Theorem 6 // we let P^ n \6,^) be the MRE distribution after n steps of the iterative algorithm 
described above, then 

£>(P (1) ||P ) > D{P {2) H-Po) > • • • > D{P {n) H-Po) (50) 

The theorem is easy to understand as follows: each time we optimize any of the Qt{Xh) dis- 
tributions, we maximize the associated lower bound. This maximization relaxes the corresponding 
constraint on the MRE distribution and allows the relative entropy to be decreased. 

5 Uncertain or incompletely labeled examples 

Examples with uncertain labels are hard to deal with in any standard discriminative classification 
method, probabilistic or not. Note the difference between labels that are inherently stochastic and 
those that are predictable but merely missing (the case considered here). Uncertain labels can be 
handled in a principled way within the maximum entropy formalism: let y = {y\, . . . ,yr} be a set 
of binary variables corresponding to the labels for the training examples. We can define a prior 
uncertainty over the labels by specifying Po(y)', for simplicity, we can take this to be a product 
distribution 

Po(y) = l[Pt,o(yt) (si) 
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where a different level of uncertainty can be assigned to each example. We may, for example, 
set Ptfiivt) = 1 whenever y t is observed and Pt,o(yt) = 0.5 if the label is missing. The MRE 
solution is found by calculating the relative entropy projection from the overall prior distribution 
Po(0,7,2/) = Po(Q)Po('j)Po(y) to the admissible set of distributions V (no longer directly function 
of the labels) that are consistent with the constraints: 



U 



P(@,j,y) [y t C{X t ,Q)- lt ]dQd 1 >Q (52) 

0,7 

for all < = 1, . . . ,T. The prior distribution Po(t) m this formulation encourages decision rules that 
achieve large classification margins for the examples (most of the probability mass is assigned to 
values 7 t > 0). This preference towards large margins creates dependencies between the (a priori) 
unknown labels and the parameters of the discriminant function. Consequently, even unlabeled 
examples will contribute to the (marginal) MRE distribution P(6) that specifies the decision rule. 
We may alternatively view the MRE formulation as a transduction algorithm [22] whose objective 
is to determine the class labels for a set of unlabeled training examples. 

While this provides a principled framework for dealing with uncertain or partially labeled ex- 
amples, the MRE solution itself is not in general feasible to obtain. For example, in the context 
of support vector machines (for an alternative approach see [2]), the MRE distribution over the 
labels will be (roughly speaking) a Boltzmann machine and therefore not manageable in general via 
exact calculations. We can nevertheless employ efficient approximate methods to obtain an iterative 
algorithm for self-consistent probabilistic assignment of the uncertain labels. 

5.1 Feasible approximation 

To be able to deal with uncertain labels in a feasible way, we solve instead the following MRE 
problem with additional constraints: 

Definition 4 Given a parametric discriminant function £(X,@), a set of margin variables 7 = 
[71, . . . , 7t], a set of class variables y = [y\, . . . ,yr], and a prior distribution 



Po(Q,j,y) = P (Q) 



II P o(7t) 



JlPoAvt) 



(53) 



we find a constrained MRE distribution P(6, 7, y) of the form P(6, j)P(y) that minimizes D(P\\Po) 
subject to the constraints 

yf P(@, 1 )P(y)[y t £(X t ,@)- lt ]d@d 1 >0 (54) 

v J ^ 

for allt = l,...,T. 

We may view this as a type of mean field approximate since the MRE distribution is forced 
to factorize to make the problem tractable. The solution is no longer unique but can be obtained 
through the following two-stage iterative algorithm: 

Step 1. Fix P(y) and let p t = J2 y P(y)yt- We find P(6,7) as the MRE solution subject to the 
constraints 

[ P(@, 1 )[p t £(X u @)- lt ]d@d 1 >0 (55) 

Note that since the prior factorizes across {6,7} the MRE solution factorizes as well, i.e., 

p(e, 7 ) = p(e)p( 7 ). 
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Step 2. Fix the marginal P(0) obtained in the previous step and find the MRE solution P'(y,^f) 
subject to 

£/ p '^<t) |£p(e)[ yt £(x i ,e)- 7i ]deJd 7 >o (56) 

for all t. Update P(y) <- (1 — e)P(y) + eP'(y) or simply set p t <— (1 — e)p t + ep' t where 

p't = E v P'(y)yt- 

The fact that we include P( 7 ) also in the second step is necessary since any adjustments to 
the labels must be compensated by an increased margin. The distribution P(y) is updated via 
relaxation to ensure a more controlled adjustment of the labels; any large change in P(y) is likely to 
induce a significant subsequent modification to the solution of the first step. Although the iterative 
algorithm remains stable even if larger changes are made, we believe the relaxation update leads 
to better local optima. Moreover, since the admissible set is convex and because the minimization 
objective (relative entropy) is also convex, the relaxation update always yields a change in the 
appropriate direction. The solution to either step is well defined and can be obtained in closed 
form assuming the problem is tractable when we have complete information about the labels. The 
iterative algorithm is well-behaved in the sense of the following theorem: 

Theorem 7 Let P(")(6, 7 ,y) = P( n \@,"/)P( n \y) be the constrained MRE solution after n itera- 
tions. Then for all < e < 1, where e is the step size used in the algorithm, we have 

D(PW\\P )>D(PW\\P ) >...>£>(P (n) ||P ) (57) 

The result holds also after either step of the two-stage iterative algorithm. 

5.2 Example: support vector machines 

Here we provide a preliminary numeral assessment of how the above algorithm is able to make use 
of unlabeled examples in the context of predicting DNA splice sites with support vector machines. 
A detailed formulation of the algorithm for SVMs can be found in Appendix C. We generated three 
training sets of examples corresponding to whether 1) all the labels were known, 2) labels were 
provided only for about 10% randomly chosen examples and the remaining 90% were unlabeled but 
available, and 3) only the 10% labeled examples were used for training. The full training set in this 
case consisted of 500 true DNA splice sites and 500 spurious ones (false examples). The examples 
were fixed length (25) strings of DNA letters (A,C,T,G) which were translated into bit vectors using 
a four bit encoding (e.g. A — > [1000]). Figure 10 gives ROC curves based on an independent test set 
(1192 true examples and 3532 false examples) for SVMs trained with one of the three training sets. 
Note that when the training set is fully labeled the algorithm reduces to the standard formulation. 
The figures show that even the approximate formulation 8 is able to reap most of the benefit from 
the unlabeled examples. The finding is also robust against the choice of the kernel function as is 
seen by comparing Figure 10a) and 10b). The findings are preliminary. 

6 Discussion 

We have presented a general approach to discriminative training of model parameters, structures, 
or parametric discriminant functions. The formalism is based on the minimum relative entropy prin- 
ciple reducing all calculations to relative entropy projections. Quite remarkably, we can efficiently 

8 In our experiments, e = 0.1 and the iterative algorithm was run for 10 iterations. The benefit may vary as a 
function of t and the number of iterations, particularly if e is too large. The prior probability Po(y) = Y\, Po,t{yt) 
over the labels were set to or 1 when the label for yt was observed and to 0.5 for the unlabeled ones. 
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false positives 



b) 



false positives 



Figure 10: a) test set ROC curves based on a training set with fully labeled examples (solid line), 
90% unlabeled and 10% labeled (dot-dashed), only the 10% labeled examples (dashed). In a) a 
linear kernel was used and in b) a Gaussian kernel. 

and exactly compute the best discriminative distribution over tree structures within this framework. 
The MRE idea gives, in addition, a natural discriminative formulation of anomaly detection prob- 
lems or classification problems involving partially labeled examples. Efficient algorithms were also 
given to exploit such formulations. 
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A Computing averages under a factored distribution over tree structures 

Lemma 4 If P(E) is given by equation (34) and f,g are functions of E additive in the edges (i.e. 
f{E) = T,uveE fuv) then 

<f(E)> P = i oMWflU _ (58) 



Z 



a=0 



< f(E)g(E) > P = \ 92 \ Q( z e :r a) i „ . (59) 



dadj3 



a=0=O 
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This lemma can be easily proved by equating |Q(We a ^)| with its definition (36) and then taking 
derivatives of both sides. Then, remembering that for any matrix A with elements A t j 



St" = \M{A-%j (60) 



dA i3 



one obtains, after conveniently grouping the terms, the result of Lemma 5: 

Lemma 5 Let P(E) and Q be given by equations (34) and (36) respectively, M be a symmetric 
matrix with diagonal defined by 



Muv ~ Mvu ~ \ {{Q^U v<u = n {bl) 

and f a function of the structure E satisfying f(E) = ^2 uvG e /««■ Then the average of f under P 

n 

<f(E)> P = Y, p ( E )f( E ) = E fwW uv M uv . (62) 



is 



u,v=l 



B Integrating over the parameters P(E S ,6 S ) 
Let us define 

N* v (x u ,x v ) = ^2sX t y t N s uv (x v ) = ^ s\ t y t (63) 



t:v— x v ,u— x u t:v—x v 



uv " it it r(Ni v (x uXv )) (64) 

if T(N S (x v )) 

With these notations we can express W£ v and Wq in equation (43) as 

W' = -^s_ and WJ! = T( - NS \ TT K * ( 66 ) 

In the above, T() denotes Euler's Gamma function. Note that the "counts" N^ v can be either 
positive or negative, so that the variables k may not be defined for arbitrary values of A. All the 
above expressions exist, however, for A = 0; in this case W* v = Wq = 1. 

The classification rule is given by equation (39) with w^ v (X) , Wq (X) redefined as 

w s uv (X) = H/[NZ v (x u x v ) + N s uv (x u x v )\ - 9[NZ(x v ) + N^(x v )] - 9[N^(x u ) + N s u (x u )} (67) 
w' (X) = J2 nKM + K(x v )] - V[N° + N°] (68) 

with ty representing the derivative of the log-Gamma function: 

*(z) = ^iogr(^) (69) 

Note the similarity with the fixed parameter case: the classification rule is still an average of a 
log-likelihood difference; the ^ functions arise from averaging the log-likelihood under the MRE 
distribution of the 9 parameters. 
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C Uncertain labels and support vector machines 

We provide here more details about the two step feasible algorithm for dealing with partially la- 
beled examples in the context of support vector machines. We start by defining the prior distribution 
over all the parameters as 

P o (0, b, 7, y) = P (6)P (b)P ( 1 )P (y) (70) 

where Po(9) is M(0,I) and Po(b) approaches a non-informative prior. By the non-informative prior 
we mean here a limit of Po(b\k) = M(0, 1 • k) as k — > oo. The prior over the labels is assumed to 
factorize across the examples, i.e., 

Po(y) = l[Po,t(yt) (7i) 

t 

where, for example, we can set each Po,t(yt) = 1 whenever the corresponding label yt is known and 
Po,t(yt) = 0.5, y t = ±1 for all unlabeled examples. We use here -Po(t) from eq. (9); the alternatives 
were discussed in the text. 

Let now p t = J2 y Po(y)yt = J2 Vt Po,t(yt)yt, where p t is the mean value of the label. With these 
initializations, the two step algorithm is given as follows: 

Step 1. We fix {p t } and find the MRE solution for P{0,b,-f). Based on Lemma 3 P(0,7) and P{b) 
can be found separately. For P(9, 7) the the Lagrange multipliers are obtained by maximizing 
(analogously to Theorem 2) : 

J«, 7 (A) = £[A t +log(l - X t /c)]- ^A t A t - ftft -(X t T X t -) (72) 

t t,t' 

subject to the constraint that J2 t X t pt = 0. This is no more difficult to solve than the original 
SVM optimization problem with hard labels. 

As for the bias term b, we only need its mean relative to the MRE solution, i.e., b = J P(b)bdb. 
This can be computed as the limit of the means corresponding to proper priors Po(&|^) (each 
MRE solution P(b\k) based on P (b\k) is a Gaussian with a well-defined mean). We omit the 
algebra and instead provide the answer in terms of the following averages: 

U = I ' P(6)(0 T X t )d0 = Y,h'Pf(XlX t ,) (73) 

■' v 

It = fp{i)itdi=l l — (74) 

J c-\t 

The desired mean b is now given by 

b = arg max j min( p t (L t + b) - j t ) ] (75) 

This setting optimizes the most critical constraints of eq. (55). In other words, b maximizes 
the minimum of the left hand sides of eq. (55) . 

Step 2. To update the MRE distribution over the labels, we fix P(8,b) and find P'(y,^f) subject to 

E / P '^< i) i P ^' 6 ) [ V^ QTx t + b)~7t] d9dbd 1 
yJ Je,b 

= Y, [p'(y,7)[yt(L t + b)- 7t ]d 7 >o (76) 

v J 
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Analogously to the first step, the Lagrange multipliers are found by maximizing the corre- 
sponding —logZ (algebra omitted): 

Jy,^')=J2\ X t+^g(l-\' t /c)-\og E Po,t(yt)e ytX ' iLt+ ~ b) ) (77) 

t I Vt=±l J 

Note that the Lagrange multipliers here are not tied and can be optimized independently for 
each t. This happens because we have assumed that the prior distribution factorizes across the 
examples and because the discriminant function does not tie the variables together. Each of 
the one dimensional convex optimization problems are readily solved by any standard methods 
(e.g. Newton- Raphson). The resulting MRE distribution over the labels, P'(y) is given by 

i"(i/)=n p t(i/t) (78) 

t 
where 

Pl(Vt) = T P °Me VtX ' iLt+ ~ b) (79) 

At 

We can easily compute p' t = £) Pl(y t )yt from this result. Finally, the updates 

Pt <- (1 - e)p t + ep' t (80) 

complete the second step. 
The decision rule for a new example X is given by 



y = sign( ^A ift (X t T X) + 6 



(81) 
where {X t } and b are the solutions to the first step of the iterative algorithm. 



26 



